Practical Implementations of Principal Component Analysis

Nick Belgau, Oscar Hernandez Mata

2024-08-01

The Math behind PCA

SVD is faster and more accurate than eigen-decomposition

Although PCA is traditionally taught through eigen-decomposition of the covariance matrix, Singular Value Decomposition (SVD) is always used in practice.

  • Numerical stability: No need to square the covariance matrix (\(X^TX\)) which can amplify errors; robust against ill-conditioned matrices.
  • Efficient with large datasets: Directly decomposes the data without needing to compute and square the covariance matrix (Johnson and Wichern 2023).

sklearn.decomposition.PCA

stats::prcomp()

SVD decomposition
Ensure features are continuous and standardized (\(\mu\) = 0, \(\sigma\) = 1):
\[ X = U \Sigma V^T \] Each column of \(𝑉\) represents a principal component (PC) which are orthogonal to each other and construct the new axes of maximum variance.

Calculate explained variance by PC
The diagonal singular value matrix \(\Sigma\) corresponds to the strength of each PC:
\[ \text{variance_explained} = \frac{\sigma_i^2}{\sum \sigma_i^2} \]

Dimensionality reduction:
- Select PCs: Based on cumulative explained variance target (95%).
- Truncate \(V\): Select top PCs to reduce dimensions and transform \(X\) into a new feature space. \[ X_{\text{transformed}} = X V_{\text{selected}} \]

The effectiveness of PCA relies on satisfying these points.

  1. Linearity
    PCA assumes that resulting principal components are linear combinations of the original variables. Nonlinear relationships may lead to low covariance values which can lead to an undervalued representation of their signficance.

  2. Continuous data
    PCA begins by standardizing the data, so the features should come from continuous distribution. Because the scale of measurement conveys information about the variance, categorical variables should be handled separately.

  3. Data standardization

  • Scaling standardizes the variance of each variable to ensure equal contributions.
  • Mean-centering has a similar impact: ensuring that the principal components capture the true direction of maximum variance.
  • Outliers can also distort the PCs, so they should be identified and handled appropriately.

Application 1 - Demographic Data

Dataset Description

This application was inspired by a paper published by UWF (Amin, Yacko, and Guttmann) on Alzheimer’s disease mortality (Tejada-Vera 2013) (Amin, Yacko, and Guttmann 2018).

The dataset is derived from US census data and contains demographic, health, and environmental metrics for counties in the United States. The dataset was filtered to select counties in the deep south.

Column
Obesity Age Adj
Smoking Rate
Diabetes
Heart Disease
Cancer
Food Index
Poverty Percent
Physical Inactivity
Mercury TPY
Lead TPY
Atrazine High KG

Application 1 - Demographic Data

Check Assumptions: Continuous Variables

  • The variables appear to be continuous because the data types are “numeric” with high cardinality.
  • There are no nulls.
  • It is clear that scaling and mean-centering will be needed.
skim(data)
Data summary
Name data
Number of rows 1143
Number of columns 11
_______________________
Column type frequency:
numeric 11
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
obesity_age_adj 0 1 31.65 3.76 19.00 29.16 31.26 33.95 46.92 ▁▆▇▂▁
Smoking_Rate 0 1 25.06 3.56 10.72 22.94 25.62 27.68 32.86 ▁▁▃▇▂
Diabetes 0 1 10.64 1.62 6.48 9.32 10.56 11.69 17.92 ▂▇▆▁▁
Heart_Disease 0 1 126.68 38.46 41.20 99.90 120.10 146.85 279.20 ▂▇▃▁▁
Cancer 0 1 187.78 26.55 75.33 170.25 188.26 204.40 370.64 ▁▇▆▁▁
Mercury_TPY 0 1 0.02 0.07 0.00 0.00 0.00 0.00 0.94 ▇▁▁▁▁
Lead_TPY 0 1 0.14 0.30 0.00 0.01 0.04 0.14 2.80 ▇▁▁▁▁
Food_index 0 1 6.60 1.30 0.00 5.90 6.70 7.40 10.00 ▁▁▃▇▁
Poverty_Percent 0 1 19.21 6.67 0.00 14.90 18.80 23.15 47.70 ▁▇▇▁▁
Atrazine_High_KG 0 1 4531.16 24239.04 0.00 94.50 632.80 3480.10 768660.60 ▇▁▁▁▁
SUNLIGHT 0 1 17689.04 1037.82 15389.96 16897.70 17723.25 18285.74 21671.87 ▃▇▆▂▁

Application 1 - Demographic Data

Check Assumptions: Linarity Analysis

The Harvey-Collier Test can automate this by running pairwise test for linearity, and if a significant nonzero slope on the residuals exists, the p-value will be less than alpha (Maureen, Oyinebifun, and Christopher 2022) (Harvey and Collier 1977).

Visually checking scatter plots is not a realistic method for inspecting linearity in real-world applications. Correlation plots are insufficient.

Code
harvey_collier_test <- function(data, x, y) {
  formula <- as.formula(paste(y, "~", x))
  model <- lm(formula, data = data)
  test <- harvtest(model)
  p_value <- test$p.value
  return(signif(p_value, digits = 2))
}

variables <- names(data)
n <- length(variables)
p_matrix <- matrix(NA, n, n, dimnames = list(variables, variables))

for (i in 1:n) {
  for (j in 1:n) {
    if (i != j) {  # Avoid testing a variable against itself
      p_matrix[i, j] <- harvey_collier_test(data, variables[i], variables[j])
    }
  }
}

Application 1 - Demographic Data

Check Assumptions: Linarity Analysis (continued)

  • The residual plot for a single pair that was flagged as nonlinear reveals why the Harvey-Collier Test declared nonlinearity.
  • Transformations may risk altering other variable relationships, so no transformations were applied, acknowledging some information loss in PCA.
Code
data_residual <- as.data.frame(data)

data_residual$residuals <- residuals(lm(Diabetes ~ obesity_age_adj, data = data_residual))

ggplot(data_residual, aes(x = obesity_age_adj, y = residuals)) +
  geom_point() + 
  geom_smooth(method = "loess", se = FALSE, color = "blue") +  # LOWESS curve
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residuals of Diabetes vs. obesity_age_adj",
       x = "obesity_age_adj", y = "Residuals") +
  theme_minimal()

Application 1 - Demographic Data

Check Assumptions: Outliers Analysis

note: the mean-centering and scaling is handled within the PCA implementation

  • Outliers can distort PCA results by disproportionately increasing variance, shifting the direction of principal components, and inflating eigenvalues.
  • Although PCA does not require normality, a roughly normal distribution minimizes the impact from outliers.
  • Non-normal data may undergo transformations like the Box-Cox to approximate normality.
  • Assessing skewness and kurtosis offers practical insights into distribution characteristics.
                   Kurtosis   Skewness
Atrazine_High_KG 866.377802 27.6792372
Mercury_TPY       72.724213  7.4402444
Lead_TPY          34.286296  5.0314443
Cancer             5.489565  0.3562768
Food_index         4.878495 -0.8643685
Heart_Disease      3.867908  0.8405355
Poverty_Percent    3.795263  0.4674652
obesity_age_adj    3.680773  0.2879323
Smoking_Rate       3.531763 -0.7543274
Diabetes           3.344971  0.5427094
SUNLIGHT           2.974585  0.3380277

Application 1 - Demographic Data

Check Assumptions: Outliers Analysis (continued)

                 Lambda
Lead_TPY            0.1
Mercury_TPY        -0.1
Atrazine_High_KG    0.1

Application 1 - Demographic Data

Correlation Analysis

  • While not a complete diagnosis, identifying highly correlated variables can indicate multicollinearity within the dataset (VIF will be looked at later).
  • This provides insights to the effectiveness of PCA because correlated variables can be transformed into orthogonal components which eliminate multicollinearity.
Amin, R. W., E. M. Yacko, and R. P. Guttmann. 2018. “Geographic Clusters of Alzheimer’s Disease Mortality Rates in the USA: 2008-2012.” Journal of Prevention of Alzheimer’s Disease (JPAD) 3.
“Compression of Spectral Data Using Box-Cox Transformation.” 2014. Color Research & Application 39 (2). https://doi.org/10.1002/col.21771.
Harvey, A., and P. Collier. 1977. “Testing for Functional Misspecification in Regression Analysis.” Journal of Econometrics 6: 103–19.
Johnson, Richard, and Dean Wichern. 2023. Applied Multivariate Statistical Analysis. Pearson.
Maureen, Nwakuya Tobechukwu, Biu Emmanuel Oyinebifun, and Ekwe Christopher. 2022. “Investigating Instability of Regression Parameters and Structural Breaks in Nigerian Economic Data from 1984 to 2019.” International Journal of Mathematics Trends and Technology 68 (12): 67–73. https://doi.org/10.14445/22315373/IJMTT-V68I12P509.
Tejada-Vera, Betzaida. 2013. “Mortality from Alzheimer’s Disease in the United States: Data for 2000 and 2010.” NCHS Data Brief, no. 116: 1–8.